Graph spectra and the detectability of community structure in networks 



<N : 

^— i , 

o ■ 
(N ■ 

>> : 



oo 



c/3 
o 



> 

OO 

O 

(N 



X 



Raj Rao Nadakuditi 1 and M. E. J. Newman 2 

1 Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109 
^Department of Physics and Center for the Study of Complex Systems, University of Michigan, Ann Arbor, MI 48109 

We study networks that display community structure — groups of nodes within which connections 
are unusually dense. Using methods from random matrix theory, we calculate the spectra of such 
networks in the limit of large size, and hence demonstrate the presence of a phase transition in 
matrix methods for community detection, such as the popular modularity maximization method. 
The transition separates a regime in which such methods successfully detect the community structure 
from one in which the structure is present but is not detected. By comparing these results with recent 
analyses of maximum-likelihood methods we are able to show that spectral modularity maximization 
is an optimal detection method in the sense that no other method will succeed in the regime where 
the modularity method fails. 



The problem of community detection in networks has 
attracted a substantial amount of attention in recent 
years 0, 0]. Communities in this context are groups 
of vertices within a network that have a high density of 
within-group connections but a lower density of between- 
group connections. The challenge is to find such groups 
accurately and efficiently in a given network — the ability 
to do so would have applications in the analysis of ob- 
servational data, network visualization, and complexity 
reduction and parallelization of network problems. 

In this paper we focus on matrix methods for com- 
munity detection, which are based on the properties of 
matrix representations of networks such as the adjacency 
matrix or the modularity matrix. While significant effort 
has been devoted to the development of practical algo- 
rithms using these methods, there has been less work on 
formal examination of their properties and implications 
for algorithm performance. Here we give an analysis of 
the spectral properties of the adjacency and modularity 
matrices using random matrix methods, and in the pro- 
cess uncover a number of results of practical importance. 
Chief among these is the presence of a sharp transition 
between a regime in which the spectrum contains clear 
evidence of community structure and a regime in which 
it contains none. In the former regime, community de- 
tection is possible and current algorithms should perform 
well; in the latter, any method relying on the spectrum 
to perform structure detection must fail. A similar phase 
transition has been reported recently in an analysis of a 
different class of detection methods, based on Bayesian 
inference Q. By comparing the two analyses, we are able 
to demonstrate that methods such as modularity maxi- 
mization are optimal, in the sense that no other method 
will succeed where they fail. 

For the formal analysis of community structured net- 
works, we must define the particular network or networks 
we will study. In this paper we focus on the most widely 
studied model of community structure, the stochastic 
block model, although our methods could be applied to 
other models as well. The stochastic block model, in 
its simplest form, divides a network of n vertices into 



some number q of groups denoted by r = 1 . . . q and then 
places undirected edges between vertex pairs i,j with in- 
dependent probabilities p rs , where r, s are respectively 
the groups to which vertices i,j belong. In other words, 
the probability of an edge between two vertices in this 
model depends only on the groups in which the vertices 
fall. If the diagonal elements of the matrix of probabili- 
ties p rs are greater than the off-diagonal elements, then 
the network displays classic community structure with 
a greater density of edges within groups than between 
them. Particular instances of the stochastic block model 
are commonly used as testbeds for assessing the per- 
formance of community detection algorithms — especially 
in the "four groups" test [l[ and the planted partition 
model 0. 

Let us first demonstrate our argument for the sim- 
plest possible case of a network with q — 2 groups of 
equal size \n each and just two different probabilities p- m 
and p ou t for connections within and between groups. We 
focus particularly on the case of sparse networks, those 
for which the fraction of possible edges that are present 
in the network vanishes in the limit of large n, which ap- 
pears to be representative of most networks observed in 
the real world, although our results apply in principle to 
dense networks as well. 

The adjacency matrix A of an undirected network is 
the n x n symmetric matrix with elements Ay = 1 if 
vertices i and j are connected by an edge and otherwise. 
If we average the adjacency matrix over the ensemble 
of our stochastic block model the resulting matrix (A) 
has elements equal to p- ln for vertices in the same group 
and p ou t for vertices in different groups. Defining c ln = 
npi n and c ou t = np ou %, this matrix can be written in the 
form 



(A) = i(c in + Cout) H T + 5(c in ~ Cout) UU 5 



(1) 



where 1 and u are the unit vectors 1 = (1, 1,1, . . ■)/ \fn 
and u = (1, 1, . . . , — 1, — 1, . . .)/^/ri, the ±1 elements in 
the latter denoting the members of the two communities. 

Now the full adjacency matrix can be written in the 
form A = (A) + X where the matrix X is the deviation 
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between the adjacency matrix and its average value. By 
definition, X is a symmetric random matrix with inde- 
pendent elements of mean zero. 

Our analysis will focus on the spectrum of eigenvalues z 
of the adjacency matrix, which we calculate in several 
steps. We start by calculating the spectral density p(z) 
of the matrix X alone, whose average value in the random 
ensemble can be written in terms of the imaginary part 
of the Stieltjes transform: 
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Im(Tr^I-X)- 1 ) 



(2) 



where (...) indicates the ensemble average. The average 
of the trace can be expanded in powers of X as 
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Tr(X fc ) 



fc=0 



where the individual terms take the form 
Tr(X fc )= 2_, (Xi ll2 X i2 i 3 . . . X K 



(3) 



(4) 



Since the elements of X have mean zero, any term in this 
sum that contains any variable just once will average to 
zero. Moreover, terms containing any variable more than 
twice become negligible when the average degree of the 
network is much greater than one, so that the only terms 
remaining are those for which k is even and which contain 
each variable exactly twice. Geometrically, the sequence 
of indices in these terms takes the form of an Euler tour of 
a rooted plane tree, with a factor of (X^) on each edge, 
whose average value is |(p in + p out ) = (Qn + c out )/2n. 
Writing k = 2m with m integer, there are n m+1 ways to 
choose the m + 1 vertices of the tree and the number of 
topologically distinct rooted plane trees with this many 
vertices is equal to the Catalan number C m . Thus 



Tr(X 2m ) = 



m+l / u in 



C out 



C ii 



2n 

~ n [^( c in + c ou t)] Cm- 
Combining this result with Eq. ([3]), we have 



(5) 



(Tr(zl - X)- 1 ) = '±Y, [K c in + c ont )] m C m /z 2m 

Z ~ \Jz 2 - 2(c in + C out ) 
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Then the spectral density, Eq. ([2]l, is 



p(z) = (tj/tt) 



v /2(c in + Cput) 
Qn H~ C ut 



(6) 



(7) 



which is a modified form of the classic Wigner semicir- 
cle law for random matrices. Note that the density of 



eigenvalues increases with n, which implies that the fluc- 
tuations in the values vanish as n — > oo. 

Armed with this result we can now calculate the spec- 
trum of the adjacency matrix A = (A) +X, but again we 
take the calculation in stages, starting with the simpler 
exercise of calculating the spectrum of the matrix 

B = i(c in -c ut)uu T + X = A-i(c in + c ut)ll T (8) 

Note that ^(ci n + c ou t) H T is the uniform matrix with 
all elements equal to |(pi n +Pout)j which is the average 
probability p of an edge in the entire network. Hence the 
elements of B are Bij = Aij—p. This matrix is of interest 
in its own right. It is the so-called modularity matrix, 
which forms the basis for the modularity maximization 
method of community detection. The modularity matrix 

Aij — Pij where Pij is the 



is usually defined by Bi 



expected value of the adjacency matrix element in a null 
model containing no community structure. The most 
commonly used null model is the configuration model, 
a random graph with specified degree distribution, but 
in the present case, for which all vertices have the same 
expected degree, the null model is just a standard Erdos- 
Renyi random graph with Py = p for all i,j, leading to 
the definition in Eq. ([8]) . Thus our calculation will in this 
case give us also the spectrum of the modularity matrix. 

The general form of the matrix B is that of a rank-1 
matrix uu T plus a random perturbation, a form that has 
been studied in the mathematical literature. Following 
an argument of [5|,l6(, let z be an eigenvalue of this matrix 
and v be the corresponding normalized eigenvector, so 
that 

[^(Cin-C ut)uU T +X]v = ZV. (9) 

A rearrangement gives (zl — X)v = \{c m — c out ) uu T v, 
where I is the identity. Multiplying by u T (zI — X)" 1 and 
cancelling a factor of u T v, we find that 
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(u T Xi 



A, 



(10) 



where Ai is the ith eigenvalue of X and Xj is the corre- 
sponding eigenvector. 

The solutions of this equation, which give the eigenval- 
ues z of the modularity matrix, are represented graphi- 
cally in Fig. OJi. The right-hand side of the equation has 
poles at z = Xi for all i and, as the figure shows, this 
means that the eigenvalues must satisfy z\ > Ai > z 2 > 
A2 > . . . > z n > X n , where both sets of eigenvalues are 
numbered in order from largest to smallest. These in- 
equalities place bounds on the eigenvalues z% . . . z n that 
become tight as n — > oo, meaning that the spectrum of 
the modularity matrix is asymptotically identical to that 
of the random matrix X. 

The only exception is the highest eigenvalue z± , which 
is bounded below by Ai but unbounded above. To calcu- 
late this eigenvalue we note that since X is a random ma- 
trix, its eigenvectors are also random, so that cross-terms 
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FIG. 1: (a) The solid curve represents the right-hand side of 
Eq. (|10[) while the dashed horizontal line represents the left- 
hand side. The points at which the two cross, indicated by the 
dots, are the solutions Zi of the equation, which necessarily fall 
between the eigenvalues Xi of the matrix X (vertical dashed 
lines), (b) The spectrum of the modularity matrix is the same 
of that of the random matrix X (semicircle) , except for the 
highest eigenvalue zi, which is separate from the rest of the 
spectrum above the transition point given in Eq. (|14|) . 



cancel in the quantity (u T x^) 2 and the average value is 
simply |xi| 2 /n = 1/n. Taking the average of (fT0|) over 
the random matrix ensemble in the limit of large n then 
gives 



Cout 



n ( y _ 



1 



Z - Xi 



-(Tr(zl-X)- 1 ) 



y/z 2 - 2(c in + C out ) 



(ii) 



where we have used Eq. (|6]). Rearranging for z, we get 
our expression for the leading eigenvalue Z\: 



Z\ = \{Cin - Cout) + 



C out 



Cin C ut 



(12) 



We can use the same method to deduce the spec- 
trum of the full adjacency matrix also. From Eq. ([8]) 
we see that the adjacency matrix takes the form A = 
|(cin + Cout) H T + B, which is again a rank-1 matrix 
plus a random perturbation. By the same argument as 
before, we can show that this matrix has all eigenval- 
ues the same (to within tight bounds) as those of the 
modularity matrix, except again for the leading eigen- 
value, whose value can be calculated from a relation of 



the form (|10|) . The end result is that the lower n — 2 
eigenvalues of the adjacency matrix have the same spec- 
trum as the random matrix X and the top two have the 
values Zi, Eq. (fP2"]) . and 



Z2 



\{Cin + Cout) + I- 



(13) 



With this result, we now have the complete spectrum for 
both the adjacency matrix and the modularity matrix. 

Let us focus on the modularity matrix. The spectrum 
is depicted in Fig. [T}d and consists of the continuous semi- 
circlar band of eigenvalues, Eq. (J7]), plus the single eigen- 
value z\ , Eq. ([T2"T) . If the network contained no commu- 
nity structure, then z\ would not be separated from the 
continuous band as it is here. So long as it is well sepa- 
rated the spectrum shows clear evidence of the existence 
of community structure and one can reasonably say that 
a calculation of the spectrum constitutes positive "de- 
tection" of that structure. Moreover, the signs of the 
elements of the leading eigenvector provide a good guide 
to the community division of the network, and indeed 
this particular method for community identification can 
be derived directly as a spectral version of the standard 
method of modularity maximization [2j. If, however, the 
position of the leading eigenvalue passes the edge of the 
continuous band, the spectrum no longer shows evidence 
of community structure and spectral algorithms based 
on the corresponding eigenvector will fail. One might 
imagine that this point would arrive when c; n = c ou t, 
which is the point at which the network contains no com- 
munity structure at all, but this is not the case. From 
Eq. ([7]) we see that the end of the continuous band falls 
at z = yj2(c- ln + Cout) and, setting z\ from Eq. (fT2j) equal 
to this value, we find that we lose the ability to detect 
community structure at an earlier point, when 



Cout = \/2(ci n + Cout)- 



(14) 



This value sets a dctcctability threshold beyond which 
the communities are present but cannot be detected. For 
Cin — c ou t smaller than this value, but greater than zero, 
community structure is present in the network in the 
sense that the average probability of edges within groups 
is measurably higher than that between groups, but we 
nonetheless fail to find the communities using our spec- 
tral method. One can generalize the calculation to net- 
works with a larger number q of communities and we find 
that a similar transition happens at the point 



Cout = yq\c\n + {q - i)c ou t] 



(15) 



The existence of a transition of this kind, though not 
its precise location, was demonstrated previously using 
different methods by Reichardt and Leone [1] and there 
are also close connections between our calculation and 
the theory of disordered systems 

One might imagine this transition to be a particular 
property of the spectral method we have considered. Per- 
haps a different modularity maximization algorithm, one 
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not based on spectral techniques, or a different type of 
community detection method altogether, would be able 
to get past this detectability threshold. This, however, is 
also not the case. 

In recent work, Decelle et al. [3] have used argu- 
ments based on the cavity method of statistical physics 
to demonstrate the existence of a transition akin to the 
one above in another community detection method, a 
Bayesian maximum-likelihood method based on directly 
fitting the stochastic block model to a network. More- 
over, their transition falls at the same position as that of 
Eq. (|15l) . The importance of this result stems from the 
fact that if we know the model from which a network is 
drawn, then fitting directly to that model is provably the 
optimal way of recovering the parameters of the model 
used to generate the network — including, in this case, 
the community structure. Thus, as Decelle et al. have 
pointed out, their maximum-likelihood method is an op- 
timal method in the sense that no method can detect 
communities in the regime where their method fails. Un- 
fortunately fitting to the stochastic block model turns 
out to be a poor method of community detection for real- 
world networks 0, [13] , but its optimality in the present 
case is a useful result nonetheless. It implies, given that 
the detectability transition falls in the same place as 
for the spectral modularity method, that the modular- 
ity method is also optimal in the same sense: no other 
method will detect communities in the network when the 
modularity method does not 
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We can take these calculations further. For instance, 
we can calculate the expected fraction of vertices classi- 
fied correctly by the spectral algorithm. We can show 
that the elements of the leading eigenvector v of the 
modularity matrix are equal to ±a/^/n plus Gaussian 
perturbations with variance (1 — a 2 )/n, where 

2 (cin C ut) 2(cjn -(- C ou t) , . . 

a = To -r- ^2 • ( 16 ) 

Then the fraction of elements that retain the correct sign 
and hence give correct classifications of the corresponding 
vertices is \ [1 + erf i/a 2 /2(l — a 2 )] , where erf x is the 
Gaussian error function. 

Figure [2] shows a plot of this quantity as a function of 
Qn — c ut for networks with several different values of the 
average degree, along with results for the same quantity 
from actual applications of the spectral modularity algo- 
rithm to networks generated using the stochastic block 
model. As the figure shows, the agreement between the 
two is excellent, except in the immediate vicinity of the 
phase transition, where finite-size effects produce some 
rounding of the threshold. The fraction of correctly clas- 
sified vertices (minus ^) plays the role of an order param- 
eter for the detectability transition. Since it is continuous 
at the transition point, we have a continuous phase tran- 
sition. 

The calculations presented here could be extended in 
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FIG. 2: The fraction of vertices correctly classified by the 
spectral modularity algorithm in networks generated using 
the block model studied here, as a function of Ci n — c ou t, for 
four different values of the average degree as indicated. Points 
are numerical measurements for networks of 100 000 vertices, 
averaged over 25 networks each; solid curves represent the an- 
alytic prediction. The phase transition at which the algorithm 
fails is clearly visible in each curve. 



a number of additional directions. For instance, the re- 
sults given are accurate for networks with large average 
degree but for networks with smaller degree there are 
additional corrections that corresponding to additional 
terms in the trace, Eq. ([5]). A calculation of these sub- 
leading terms would help to complete the picture for low- 
degree networks. Also our calculations all use the stan- 
dard stochastic block model, and although this is the 
model most widely used for benchmark calculations and 
synthetic tests, other models have been proposed, such 
as the degree-corrected block model [lfj or more exotic 
models such as the LFR benchmark networks [l^. It 
would be useful to know if results similar to those de- 
scribed here can be derived for these more complex mod- 
els. 
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